Question 2¶

Author: Michal Kubina¶

1. Motivation for my query and method:¶

The common stereotype or someone might say even the phenomenon, about the nursing profession (working as a nurse), is that it is represented mainly by women. According to the data in the US, there is 9 percent of males and 91 percent of women working as a nurse (source: https://www.fastaff.com/blog/male-nursing-statistics).

I would like to find out if this ratio can also be seen in the data from google pictures. Thus, my query is "nurse". I will download 10000 pictures and analyze the ones that are in the correct format and contain at least one face. The method I will use for answering the question is face detection and consequent gender detection of these faces. For this, I will use a classifier specifically trained to detect faces.

I will count the men's faces and women's faces for each picture and consequently analyze these data. However, I should take into account that sometimes in these pictures are also doctors, so these male doctors could bias the result in the end. Moreover, patients can also be seen in the pictures, so there is another issue that needs to be taken into account during analysis and interpretation of the results.

2. Analysis¶

Initialize libraries:

In [1]:
from simple_image_download import simple_image_download as s
from tensorflow.keras.preprocessing import image
import os
from tensorflow.keras.models import load_model
import numpy as np
import pandas as pd
import plotly.express as px
import cv2
from PIL import UnidentifiedImageError
from tqdm.notebook import tqdm
import plotly.offline as pyo
pyo.init_notebook_mode()
import warnings
warnings.simplefilter("ignore", UserWarning)
import math

Download the models:

In [2]:
#!wget https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml
#!wget https://github.com/oarriaga/face_classification/raw/master/trained_models/gender_models/gender_mini_XCEPTION.21-0.95.hdf5

Define useful functions:

In [3]:
def download_images(query, number):
    """
    Downloads images based on query

    Parameters
    ----------
    query : string
        string with keyword
    number : integer
        number of images to download
    """
    response2 = s.simple_image_download #initialize response
    response2().download(query, number) #download query
    
def load_image_from_path(image_path, target_size=None, color_mode='rgb'):
    """
    Loads image from path

    Parameters
    ----------
    image_path : string
        path of the image
    target_size : tuple
        target size of the image
    color_mode : string
        rgb or grayscale
        
    Returns
    ---------- 
    loaded image
    """
    pil_image = image.load_img(image_path, target_size=target_size, color_mode=color_mode) #load image
    return image.img_to_array(pil_image) #return image in an array

def apply_offsets(face_coordinates, offsets):
    """
    Derived from https://github.com/oarriaga/face_classification/blob/
    b861d21b0e76ca5514cdeb5b56a689b7318584f4/src/utils/inference.py#L21
    """
    x, y, width, height = face_coordinates #get face coordinates
    x_off, y_off = offsets #get offset
    return (x - x_off, x + width + x_off, y - y_off, y + height + y_off)

def identify_gender(gender_classifier, faces, offsets, shape_gender):
    """
    Identifies all the gender in the faces

    Parameters
    ----------
    gender_classifier : instance of classifier
        classified gender
    faces : list
        list of faces
    offsets : tuple
        offsets for faces
    shape_gender : input shape from classifier
        input shape
        
    Returns
    ---------- 
    res_men : int
        number of men in image
    res_women : int
        number of women in image
    res_skipped : int
        number of faces skipped
    """
    res_men = 0 #initialize number of men faces
    res_women = 0 #initialize number of women faces
    res_skipped = 0 #initialize number of skipped faces
    labels = ['woman', 'man'] #initialize labels

    for face_coordinates in faces: # using the output of the CascadeClassifier
        x1, x2, y1, y2 = apply_offsets(face_coordinates, offsets) # extends the bounding box
        face_img = gray_image[y1:y2, x1:x2] # only get the face 
        
        if face_img.shape[0] == 0 or face_img.shape[1] == 0: #skip the face if it is wrong shape
            res_skipped += 1
            continue
            
        face_img = cv2.resize(face_img, (shape_gender)) # resize the image
        face_img = face_img.astype('float32') / 255.0 # preprocess the image
        face_img = np.expand_dims(face_img, 0) # batch of one
        probas = gender_classifier.predict(face_img) #predict probabilities of face' gender
        result = labels[np.argmax(probas[0])] #get the label with biggest probability 
        
        if result == 'man': #count men
            res_men += 1
        else: #count women
            res_women += 1
            
    return res_men, res_women, res_skipped

Download 10 000 images:

In [4]:
#query = "nurse"
#download_images(query,10000)

Analysis:

In [5]:
directory = os.fsencode("simple_images/nurse/") #get all files in dir

res = {} #initialize dictionary 
res["men"] = [] #initialize array based on key word men
res["women"] = [] #initialize array based on key word women
res["skipped"] = 0 #get number of skipped images

face_classification = cv2.CascadeClassifier('haarcascade_frontalface_default.xml') #initialize face classifier 
gender_classifier = load_model('gender_mini_XCEPTION.21-0.95.hdf5', compile = False) #initialize gender classifier

offsets = (10, 10) #initialize offsets
shape_gender = gender_classifier.input_shape[1:3] #initialize shape_gender

counter_valid_images = 0 #initialize counter for valid images
counter_valid_images_faces = 0 #initialize counter for valid images with faces

for file in tqdm(os.listdir(directory)): #loop through all the files
    filename = os.fsdecode(file) #get filename of the file
    try: #try to load image and except on error
        pre_image = load_image_from_path(("simple_images/nurse/" + filename), color_mode='grayscale')
    except UnidentifiedImageError:
        continue
    gray_image = np.squeeze(pre_image).astype('uint8') #squeeze it with numpy
    faces = face_classification.detectMultiScale(gray_image, 1.3, 5) #get faces info
    
    counter_valid_images += 1 #count number of valid images
    
    if len(faces) == 0: #skip the image which has none faces
        continue
            
    res_men, res_women, res_skipped = identify_gender(gender_classifier, faces, offsets, shape_gender) #get number of men, women and skipped faces
    
    res["men"].append(res_men)
    res["women"].append(res_women)

    if res_women != 0 or res_men != 0: #if picture contains face
        counter_valid_images_faces += 1 #add 1 to number of valid images with faces
        
    res["skipped"] += res_skipped #add number of skipped faces
  0%|          | 0/10000 [00:00<?, ?it/s]
2022-01-13 20:53:23.108442: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-01-13 20:53:23.108604: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz

The first warning is not a problem, the second one is not a problem either - it is because of the m1 processor that I use.

Histogram of faces of women and men in images:

In [6]:
df = pd.DataFrame(dict(sex = np.concatenate((["men"]*len(res["men"]), ["women"]*len(res["women"]))), data = np.concatenate((res["men"],res["women"]))))
fig = px.histogram(df, x="data", color="sex", barmode='overlay', opacity=0.4)
fig.update_layout(xaxis_range=[0.5,5])
fig.show()

Plot which visualizes number of faces of men and women on each image. The jittering is added by adding some noise.

In [7]:
fig = px.scatter(x= res["women"] + np.random.normal(0,0.1,len(res["women"])), y=res["men"]+ np.random.normal(0,0.1,len(res["women"])), opacity=0.1)
#fig.update_traces(jitter=2)
fig.update_layout(
    title="Men vs Women relationship",
    xaxis_title="Women",
    yaxis_title="Men")
fig.show()
In [8]:
print("Number of images to analyze: 10 000")
print("Number of valid pictures in the data set: " + str(counter_valid_images))
print("Number of valid pictures with faces in the data set: " + str(counter_valid_images_faces))
print("Number of women faces detected: " + str(sum(res["women"])))
print("Number of men faces detected: " + str(sum(res["men"])))
print("Number of all faces detected: " + str(sum(res["men"]) + sum(res["women"])))
print("Number of skipped faces due to the corrupt face: " + str(res["skipped"]))
print("Percentage of women across the faces: " + str(math.floor(sum(res["women"])/(sum(res["women"]) + sum(res["men"]))*100.0)) + " %")
Number of images to analyze: 10 000
Number of valid pictures in the data set: 8954
Number of valid pictures with faces in the data set: 6278
Number of women faces detected: 6282
Number of men faces detected: 4072
Number of all faces detected: 10354
Number of skipped faces due to the corrupt face: 117
Percentage of women across the faces: 60 %

3. Conclusion¶

Looking at the first graph, the histogram corresponds to my hypothesis. Surprisingly, there were never more than 4 faces of one gender on each picture but this might be due to the classifier settings. Looking at the red bars, we can see that they are always higher than men's bars. This corresponds to the fact that even in google pictures the women play a bigger role.

Considering the second graph, each point consists of one picture where on the x-axis is the number of women in the picture and the y-axis is the number of men in the picture. To each point, I added some noise from normal distribution to jitter the results. We can see that there is no linear relationship between women and men being in the picture. Moreover, the pictures usually contain just one face - either woman or man. However, some pictures contain multiple combinations of faces. Surprisingly, there is not a single picture that would contain one man and one woman.

Looking at the output above - hard data, we can see that the percentage of women across the faces was 60 percent. This is somewhat surprising as it does not correspond to the real ratio of nine women to one man described earlier. This might be because in the pictures some doctors or patients bias the result. Moreover, the algorithm behind the search engine could be tweaked a little in order not to be gender-biased.

For future work, it would be interesting to use for example object detection to filter images that are from the hospital environment. In other words, pictures that contain objects typical to hospitals. Moreover, we know that usually nurses have specific colors of closes, so we could add to our analysis the most dominant colors for example.

To sum up, I believe that even from the google images there is a slight gender bias but it does not correspond to the true distribution based on the data from the United States of America.